Loading necessary libraries

library(ggpubr)
library(gprofiler2)
library(tidyverse)
library(readxl)

Check your working directory

  • Paths provided within this .rmd assume you are working within the Hackathon19_visualise_ontology project directory. Provided you have opened Hackathon19_visualise_ontology as a project this should already be the case.
  • You can check by doing the following:
getwd()
## [1] "/home/rreynolds/misc_projects/Hackathon19_visualise_ontology"

Running gProfiler2

  • Input genes are sourced from the latest PD GWAS.
  • We will be using two lists:
    • Nearest nominated gene
    • QTL nominated gene
# import PD GWAS table
PD_res <-
  readxl::read_excel(path = "./data/TableS2_Detailed_summary_statistics_nominated_risk_variants.xlsx")

# Creating a named list
geneLists <- 
  PD_res %>% 
  # Select two columns we wish to turn into lists
  dplyr::select(`Nearest Gene`, `QTL Nominated Gene (nearest QTL)`) %>% 
  # Coerce columns into list format
  as.list() %>% 
  # Across each list do the following...
  lapply(., function(x){
    x %>% 
      # Remove NAs
      na.omit() %>%
      # Convert to character vector
      as.character() %>% 
      # Keep only unique genes
      unique()
  }) %>% 
  # Name the lists
  setNames(., c("nearest_gene", "QTL_nom_gene"))


# running gprofiler
gprofilerOutput <- gprofiler2::gost(query = geneLists, # if lists are named then gost can use this an input
                               organism = "hsapiens", # set the organism 
                               correction_method = c("gSCS"), # select a correction method (gSCS reccomended in gost documentation)
                               domain_scope = c("annotated"), # select whether to restrict the background set
                               sources = c("GO"), # databases would you like to search for pathways & processes (others: "KEGG","REAC","WP")
                               significant = TRUE, # when TRUE displays only significant results (p<0.05)
                               evcodes = FALSE, # when TRUE displays the IDs of the genes calling each pathway
                               ordered_query = FALSE) # when TRUE the input list is ordered in some meaningful way
  • The background list (domain_scope argument) is an important choice, as it should represent the population of genes your list was sampled from. E.g. If you were to perform a differential gene expression experiment in brain samples from control and PD, your background list would be all the expressed genes you tested.

What do the results mean?

gprofilerOutput$result %>% 
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')
  • Notice that one of the columns is query, and within this we find the names of our inputted lists i.e. nearest_gene, QTL_nom_gene
  • gprofiler outputs a number of columns, a few of which are not entirely self-explanatory, including:
    • precision: this is the proportion of genes in the input gene list that can be assigned this term i.e. overlap.size/query.size
    • recall: this is the proportion of genes in the input list that overlap with the term i.e. overlap.size/term.size
    • parents: GO terms belong to families of terms. This column will provide the parent terms to the GO term found enriched.
  • Also, as some of you may notice there are some duplicated terms between our two lists, which are shown in the table below.
gprofilerOutput$result %>% 
  # Filter for duplicated terms
  dplyr::filter(duplicated(term_id) | duplicated(term_id, fromLast = TRUE)) %>% 
  # Rearrange column order to have term_id and term_name at beginning
  dplyr::select(term_id,term_name, everything()) %>% 
  # Arrange rows by term_id
  dplyr::arrange(term_id) %>% 
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')

How to visualise this information?

  • There are many ways to visualise these results, with a few examples given below.
  • How do you think this information would be best displayed? This is what we’re going to do over the next few days.

Visualising the results as a barplot

  • Remember we queried two lists, so we will want to plot this twice
source("./R/gen_gprofiler_barplot.R")

plot_list <- 
  gprofilerOutput$result %>% 
  # Group by query
  dplyr::group_by(query) %>% 
  # Split dataframe by query groups i.e. nearest_gene and QTL_nom_gene
  dplyr::group_split() %>%
  # For each dataframe plot a barplot
  lapply(., function(x){
    gen_gprofiler_barplot(DF = x,
                          num_categories = 15,
                          plot_title = "")
  })

# ggarrange allows us to arrange multiple ggplots on the same page
ggpubr::ggarrange(plotlist = plot_list,
                  labels = c("nearest_gene", "QTL_nom_gene"),
                  nrow = 2)  

Visualising the results with gprofiler’s in-built visualisation

gprofiler2::gostplot(gprofilerOutput) 

Visualising as a network?

  • Nodes would represent GO terms and edges the genes shared between these GO terms
  • Potential packages:

Session info

  • It is considered good practice to always your session info in the html output of an R markdown.
  • Session info will depend on the RStudio you are using, your operating system and the packages you load. Thus, when you knit the final product of your code, it may not look like it does in this version of the html output.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plyr_1.8.4        data.table_1.12.8 gridExtra_2.3    
##  [4] scales_1.0.0      readxl_1.3.1      forcats_0.4.0    
##  [7] stringr_1.4.0     dplyr_0.8.3       purrr_0.3.3      
## [10] readr_1.3.1       tidyr_1.0.0       tibble_2.1.3     
## [13] tidyverse_1.2.1   gprofiler2_0.1.7  ggpubr_0.2.4     
## [16] magrittr_1.5      ggplot2_3.2.1    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3        lubridate_1.7.4   lattice_0.20-38  
##  [4] assertthat_0.2.1  zeallot_0.1.0     digest_0.6.23    
##  [7] mime_0.7          R6_2.4.1          cellranger_1.1.0 
## [10] backports_1.1.5   evaluate_0.14     httr_1.4.1       
## [13] pillar_1.4.2      rlang_0.4.2       lazyeval_0.2.2   
## [16] rstudioapi_0.10   DT_0.9            rmarkdown_1.18   
## [19] labeling_0.3      htmlwidgets_1.5.1 RCurl_1.95-4.12  
## [22] munsell_0.5.0     shiny_1.3.2       broom_0.5.2      
## [25] compiler_3.6.1    httpuv_1.5.2      modelr_0.1.5     
## [28] xfun_0.11         pkgconfig_2.0.3   htmltools_0.3.6  
## [31] tidyselect_0.2.5  fansi_0.4.0       viridisLite_0.3.0
## [34] crayon_1.3.4      withr_2.1.2       later_0.8.0      
## [37] bitops_1.0-6      grid_3.6.1        xtable_1.8-4     
## [40] nlme_3.1-141      jsonlite_1.6      gtable_0.3.0     
## [43] lifecycle_0.1.0   cli_2.0.0         stringi_1.4.3    
## [46] ggsignif_0.6.0    promises_1.0.1    xml2_1.2.2       
## [49] ellipsis_0.3.0    generics_0.0.2    vctrs_0.2.0      
## [52] cowplot_1.0.0     tools_3.6.1       glue_1.3.1       
## [55] hms_0.5.1         crosstalk_1.0.0   yaml_2.2.0       
## [58] colorspace_1.4-1  rvest_0.3.4       plotly_4.9.1     
## [61] knitr_1.25        haven_2.1.1